Prosper loan dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than USD 12 billion in loans to more than 770,000 people. Through Prosper, people can invest in each other in a way that is financially and socially rewarding. Borrowers apply online for a fixed-rate, fixed-term loan between USD 2,000 and USD 40,000. Individuals and institutions can invest in the loans and earn attractive returns. Prosper handles all loan servicing on behalf of the matched borrowers and investors. (source: https://www.prosper.com/about)
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
pd.set_option('display.max_colwidth', None) # to display the whole strings and not collapse them
pd.set_option('display.max_columns', None) # to not collapse columns when viewing data
pd.set_option('display.max_rows', None) # to display all rows in the output
import warnings
warnings.filterwarnings("ignore") # warnings off
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
prosper_df = pd.read_csv('prosperLoanData.csv')
prosper_df
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | CO | Other | Self-employed | 2.0 | True | True | NaN | 2007-08-26 18:41:46.780000000 | 640.0 | 659.0 | 2001-10-11 00:00:00 | 5.0 | 4.0 | 12.0 | 1 | 24.0 | 3.0 | 3.0 | 2.0 | 472.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1500.0 | 11.0 | 0.81 | 0.0 | 0.17 | $25,000-49,999 | True | 3083.333333 | E33A3400205839220442E84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 78 | 19141 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 1F3E3376408759268057EDA | 330.43 | 11396.1400 | 9425.00 | 1971.1400 | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | False | NaN | 2014-02-27 08:28:14 | 680.0 | 699.0 | 1996-03-18 00:00:00 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3989.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | 9E3B37071505919926B1D82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 134815 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 1D13370546739025387B2F4 | 318.93 | 0.0000 | 0.00 | 0.0000 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | GA | Other | Not available | NaN | False | True | 783C3371218786870A73D20 | 2007-01-02 14:09:10.060000000 | 480.0 | 499.0 | 2002-07-27 00:00:00 | NaN | NaN | 3.0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | Not displayed | True | 2083.333333 | 6954337960046817851BCB2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 86 | 6466 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 5F7033715035555618FA612 | 123.32 | 4186.6300 | 3001.00 | 1185.6300 | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | False | NaN | 2012-10-22 11:02:32 | 800.0 | 819.0 | 1983-02-28 00:00:00 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 0.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.0 | 1444.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | A0393664465886295619C51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 16 | 77296 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 9ADE356069835475068C6D2 | 321.45 | 5143.2000 | 4091.09 | 1052.1100 | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | False | NaN | 2013-09-14 18:38:44 | 680.0 | 699.0 | 2004-02-20 00:00:00 | 19.0 | 19.0 | 49.0 | 6 | 220.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6193.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | A180369302188889200689E | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 11000.0 | 9947.90 | NaN | 0 | NaN | 6 | 102670 | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36CE356043264555721F06C | 563.97 | 2819.8500 | 1563.22 | 1256.6300 | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | 0.16490 | 0.0699 | 0.09500 | 4.0 | C | 5.0 | 1 | IL | Food Service Management | Employed | 246.0 | True | False | NaN | 2013-04-14 05:54:58 | 700.0 | 719.0 | 1997-09-01 00:00:00 | 9.0 | 9.0 | 41.0 | 9 | 209.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 1.0 | 0.0 | 7714.0 | 0.80 | 1886.0 | 37.0 | 0.83 | 3.0 | 0.13 | $50,000-74,999 | True | 4333.333333 | 9BD7367919051593140DB62 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 11 | 88485 | 10000 | 2013-04-22 00:00:00 | Q2 2013 | 2EC435768441332602FDC15 | 364.74 | 3647.4000 | 2238.38 | 1409.0200 | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113933 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | 0.10070 | 0.0200 | 0.08070 | 6.0 | A | 8.0 | 7 | PA | Professional | Employed | 21.0 | True | False | NaN | 2011-11-03 20:42:53 | 700.0 | 719.0 | 1992-01-17 00:00:00 | 14.0 | 13.0 | 39.0 | 9 | 495.0 | 1.0 | 4.0 | 1.0 | 5062.0 | 4.0 | 0.0 | 0.0 | 15743.0 | 0.69 | 6658.0 | 39.0 | 0.92 | 0.0 | 0.11 | $75,000-99,999 | True | 8041.666667 | 62D93634569816897D5A276 | 3.0 | 60.0 | 60.0 | 0.0 | 0.0 | 33501.0 | 4815.42 | -26.0 | 0 | NaN | 28 | 55801 | 2000 | 2011-11-07 00:00:00 | Q4 2011 | 55C4336679182766893E4FC | 65.57 | 2330.5500 | 1997.16 | 333.3900 | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 113934 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | 0.18828 | 0.1025 | 0.08578 | 3.0 | D | 3.0 | 1 | TX | Other | Employed | 84.0 | True | False | NaN | 2013-12-13 05:49:15 | 700.0 | 719.0 | 2002-02-25 00:00:00 | 12.0 | 12.0 | 25.0 | 9 | 521.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 22147.0 | 0.73 | 7853.0 | 25.0 | 1.00 | 0.0 | 0.51 | $25,000-49,999 | True | 2875.000000 | DD1A370200396006300ACA0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 123122 | 10000 | 2013-12-23 00:00:00 | Q4 2013 | 0FE0370029359765342FDB5 | 273.35 | 546.7000 | 183.15 | 363.5500 | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | 0.24450 | 0.0850 | 0.15950 | 4.0 | C | 5.0 | 2 | GA | Food Service | Full-time | 94.0 | True | False | NaN | 2011-11-14 13:18:24 | 680.0 | 699.0 | 1993-12-01 00:00:00 | 11.0 | 11.0 | 22.0 | 7 | 488.0 | 1.0 | 4.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 11956.0 | 0.69 | 4137.0 | 19.0 | 0.80 | 1.0 | 0.48 | $25,000-49,999 | True | 3875.000000 | 589536350469116027ED11B | 1.0 | 16.0 | 16.0 | 0.0 | 0.0 | 5000.0 | 3264.37 | -4.0 | 0 | NaN | 28 | 56401 | 15000 | 2011-11-21 00:00:00 | Q4 2011 | A33834861822272782621C8 | 449.55 | 21122.5600 | 15000.00 | 6122.5600 | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 113936 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | 0.09071 | 0.0299 | 0.06081 | 6.0 | A | 7.0 | 1 | NY | Professor | Employed | 244.0 | False | False | NaN | 2014-01-15 09:27:40 | 680.0 | 699.0 | 1995-01-01 00:00:00 | 10.0 | 9.0 | 44.0 | 8 | 289.0 | 0.0 | 1.0 | 1.0 | 257.0 | 3.0 | 1.0 | 0.0 | 6166.0 | 0.80 | 675.0 | 36.0 | 0.75 | 0.0 | 0.23 | $50,000-74,999 | True | 4583.333333 | 00AF3704550953269A64E40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 2 | 127508 | 2000 | 2014-01-21 00:00:00 | Q1 2014 | CE1E3704648000761C9F724 | 64.90 | 64.3307 | 47.25 | 17.0807 | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
prosper_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
prosper_df.describe(include='all')
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 113937 | 1.139370e+05 | 113937 | 28953 | 113937.000000 | 113937 | 55089 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853 | 84853.000000 | 113937.000000 | 108422 | 110349 | 111682 | 106312.000000 | 113937 | 113937 | 13341 | 113937 | 113346.000000 | 113346.000000 | 113240 | 106333.000000 | 106333.000000 | 113240.000000 | 113937.00000 | 113937.000000 | 113240.000000 | 112778.000000 | 113240.000000 | 106315.000000 | 112947.000000 | 113240.000000 | 106333.000000 | 1.063330e+05 | 106333.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 105383.000000 | 113937 | 113937 | 1.139370e+05 | 113937 | 22085.0000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.00000 | 22085.000000 | 22085.000000 | 18928.000000 | 113937.000000 | 16952.000000 | 113937.000000 | 113937.000000 | 113937.00000 | 113937 | 113937 | 113937 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.00000 | 113937.000000 | 113937.000000 |
| unique | 113066 | NaN | 113064 | 8 | NaN | 12 | 2802 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7 | NaN | NaN | 51 | 67 | 8 | NaN | 2 | 2 | 706 | 112992 | NaN | NaN | 11585 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8 | 2 | NaN | 113066 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1873 | 33 | 90831 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | 17A93590655669644DB4C06 | NaN | 2013-10-02 17:20:16.550000000 | C | NaN | Current | 2014-03-04 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | C | NaN | NaN | CA | Other | Employed | NaN | True | False | 783C3371218786870A73D20 | 2013-12-23 09:38:12 | NaN | NaN | 1993-12-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | $25,000-49,999 | True | NaN | CB1B37030986463208432A1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2014-01-22 00:00:00 | Q4 2013 | 63CA34120866140639431C9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 6 | NaN | 6 | 5649 | NaN | 56576 | 105 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 18345 | NaN | NaN | 14717 | 28617 | 67322 | NaN | 57478 | 101218 | 1140 | 6 | NaN | NaN | 185 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32192 | 105268 | NaN | 6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 491 | 14450 | 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 6.278857e+05 | NaN | NaN | 40.830248 | NaN | NaN | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | NaN | 5.950067 | 2.774209 | NaN | NaN | NaN | 96.071582 | NaN | NaN | NaN | NaN | 685.567731 | 704.567731 | NaN | 10.317192 | 9.260164 | 26.754539 | 6.96979 | 398.292161 | 1.435085 | 5.584405 | 0.592052 | 984.507059 | 4.154984 | 0.312646 | 0.015094 | 1.759871e+04 | 0.561309 | 11210.225447 | 23.230034 | 0.885897 | 0.802327 | 0.275947 | NaN | NaN | 5.608026e+03 | NaN | 1.4211 | 22.934345 | 22.271949 | 0.613629 | 0.04854 | 8472.311961 | 2930.313906 | -3.223214 | 152.816539 | 16.268464 | 31.896882 | 69444.474271 | 8337.01385 | NaN | NaN | NaN | 272.475783 | 4183.079489 | 3105.536588 | 1077.542901 | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.02346 | 16.550751 | 80.475228 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| min | NaN | 4.000000e+00 | NaN | NaN | 12.000000 | NaN | NaN | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | NaN | 1.000000 | 0.000000 | NaN | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 19.000000 | NaN | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | 0.000000e+00 | NaN | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | -209.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1000.00000 | NaN | NaN | NaN | 0.000000 | -2.349900 | 0.000000 | -2.349900 | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| 25% | NaN | 4.009190e+05 | NaN | NaN | 36.000000 | NaN | NaN | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | NaN | 4.000000 | 1.000000 | NaN | NaN | NaN | 26.000000 | NaN | NaN | NaN | NaN | 660.000000 | 679.000000 | NaN | 7.000000 | 6.000000 | 17.000000 | 4.00000 | 114.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.121000e+03 | 0.310000 | 880.000000 | 15.000000 | 0.820000 | 0.000000 | 0.140000 | NaN | NaN | 3.200333e+03 | NaN | 1.0000 | 9.000000 | 9.000000 | 0.000000 | 0.00000 | 3500.000000 | 0.000000 | -35.000000 | 0.000000 | 9.000000 | 6.000000 | 37332.000000 | 4000.00000 | NaN | NaN | NaN | 131.620000 | 1005.760000 | 500.890000 | 274.870000 | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.00000 | 0.000000 | 2.000000 |
| 50% | NaN | 6.005540e+05 | NaN | NaN | 36.000000 | NaN | NaN | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | NaN | 6.000000 | 1.000000 | NaN | NaN | NaN | 67.000000 | NaN | NaN | NaN | NaN | 680.000000 | 699.000000 | NaN | 10.000000 | 9.000000 | 25.000000 | 6.00000 | 271.000000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.549000e+03 | 0.600000 | 4100.000000 | 22.000000 | 0.940000 | 0.000000 | 0.220000 | NaN | NaN | 4.666667e+03 | NaN | 1.0000 | 16.000000 | 15.000000 | 0.000000 | 0.00000 | 6000.000000 | 1626.550000 | -3.000000 | 0.000000 | 14.000000 | 21.000000 | 68599.000000 | 6500.00000 | NaN | NaN | NaN | 217.740000 | 2583.830000 | 1587.500000 | 700.840100 | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.00000 | 0.000000 | 44.000000 |
| 75% | NaN | 8.926340e+05 | NaN | NaN | 36.000000 | NaN | NaN | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | NaN | 8.000000 | 3.000000 | NaN | NaN | NaN | 137.000000 | NaN | NaN | NaN | NaN | 720.000000 | 739.000000 | NaN | 13.000000 | 12.000000 | 35.000000 | 9.00000 | 525.000000 | 2.000000 | 7.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.952100e+04 | 0.840000 | 13180.000000 | 30.000000 | 1.000000 | 1.000000 | 0.320000 | NaN | NaN | 6.825000e+03 | NaN | 2.0000 | 33.000000 | 32.000000 | 0.000000 | 0.00000 | 11000.000000 | 4126.720000 | 25.000000 | 0.000000 | 22.000000 | 65.000000 | 101901.000000 | 12000.00000 | NaN | NaN | NaN | 371.580000 | 5548.400000 | 4000.000000 | 1458.540000 | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.00000 | 0.000000 | 115.000000 |
| max | NaN | 1.255725e+06 | NaN | NaN | 60.000000 | NaN | NaN | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | NaN | 11.000000 | 20.000000 | NaN | NaN | NaN | 755.000000 | NaN | NaN | NaN | NaN | 880.000000 | 899.000000 | NaN | 59.000000 | 54.000000 | 136.000000 | 51.00000 | 14985.000000 | 105.000000 | 379.000000 | 83.000000 | 463881.000000 | 99.000000 | 38.000000 | 20.000000 | 1.435667e+06 | 5.950000 | 646285.000000 | 126.000000 | 1.000000 | 20.000000 | 10.010000 | NaN | NaN | 1.750003e+06 | NaN | 8.0000 | 141.000000 | 141.000000 | 42.000000 | 21.00000 | 72499.000000 | 23450.950000 | 286.000000 | 2704.000000 | 44.000000 | 100.000000 | 136486.000000 | 35000.00000 | NaN | NaN | NaN | 2251.510000 | 40702.390000 | 35000.000000 | 15617.030000 | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.00000 | 25000.000000 | 1189.000000 |
df_description = pd.read_csv('Prosper Loan Data - Variable Definitions - Sheet1.csv')
#get description of all our columns in dataset
#source : https://stackoverflow.com/questions/33181846/programmatically-convert-pandas-dataframe-to-markdown-table
def pandas_df_to_markdown_table(df):
from IPython.display import Markdown, display
fmt = ['---' for i in range(len(df.columns))]
df_fmt = pd.DataFrame([fmt], columns=df.columns)
df_formatted = pd.concat([df_fmt, df])
display(Markdown(df_formatted.to_csv(sep="|", index=False)))
pandas_df_to_markdown_table(df_description)
| Variable | Description |
|---|---|
| ListingKey | Unique key for each listing, same value as the 'key' used in the listing object in the API. |
| ListingNumber | The number that uniquely identifies the listing to the public as displayed on the website. |
| ListingCreationDate | The date the listing was created. |
| CreditGrade | The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings. |
| Term | The length of the loan expressed in months. |
| LoanStatus | The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket. |
| ClosedDate | Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses. |
| BorrowerAPR | The Borrower's Annual Percentage Rate (APR) for the loan. |
| BorrowerRate | The Borrower's interest rate for this loan. |
| LenderYield | The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. |
| EstimatedEffectiveYield | Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. |
| EstimatedLoss | Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. |
| EstimatedReturn | The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009. |
| ProsperRating (numeric) | The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. |
| ProsperRating (Alpha) | The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. |
| ProsperScore | A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. |
| ListingCategory | The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans |
| BorrowerState | The two letter abbreviation of the state of the address of the borrower at the time the Listing was created. |
| Occupation | The Occupation selected by the Borrower at the time they created the listing. |
| EmploymentStatus | The employment status of the borrower at the time they posted the listing. |
| EmploymentStatusDuration | The length in months of the employment status at the time the listing was created. |
| IsBorrowerHomeowner | A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. |
| CurrentlyInGroup | Specifies whether or not the Borrower was in a group at the time the listing was created. |
| GroupKey | The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation. |
| DateCreditPulled | The date the credit profile was pulled. |
| CreditScoreRangeLower | The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency. |
| CreditScoreRangeUpper | The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. |
| FirstRecordedCreditLine | The date the first credit line was opened. |
| CurrentCreditLines | Number of current credit lines at the time the credit profile was pulled. |
| OpenCreditLines | Number of open credit lines at the time the credit profile was pulled. |
| TotalCreditLinespast7years | Number of credit lines in the past seven years at the time the credit profile was pulled. |
| OpenRevolvingAccounts | Number of open revolving accounts at the time the credit profile was pulled. |
| OpenRevolvingMonthlyPayment | Monthly payment on revolving accounts at the time the credit profile was pulled. |
| InquiriesLast6Months | Number of inquiries in the past six months at the time the credit profile was pulled. |
| TotalInquiries | Total number of inquiries at the time the credit profile was pulled. |
| CurrentDelinquencies | Number of accounts delinquent at the time the credit profile was pulled. |
| AmountDelinquent | Dollars delinquent at the time the credit profile was pulled. |
| DelinquenciesLast7Years | Number of delinquencies in the past 7 years at the time the credit profile was pulled. |
| PublicRecordsLast10Years | Number of public records in the past 10 years at the time the credit profile was pulled. |
| PublicRecordsLast12Months | Number of public records in the past 12 months at the time the credit profile was pulled. |
| RevolvingCreditBalance | Dollars of revolving credit at the time the credit profile was pulled. |
| BankcardUtilization | The percentage of available revolving credit that is utilized at the time the credit profile was pulled. |
| AvailableBankcardCredit | The total available credit via bank card at the time the credit profile was pulled. |
| TotalTrades | Number of trade lines ever opened at the time the credit profile was pulled. |
| TradesNeverDelinquent | Number of trades that have never been delinquent at the time the credit profile was pulled. |
| TradesOpenedLast6Months | Number of trades opened in the last 6 months at the time the credit profile was pulled. |
| DebtToIncomeRatio | The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). |
| IncomeRange | The income range of the borrower at the time the listing was created. |
| IncomeVerifiable | The borrower indicated they have the required documentation to support their income. |
| StatedMonthlyIncome | The monthly income the borrower stated at the time the listing was created. |
| LoanKey | Unique key for each loan. This is the same key that is used in the API. |
| TotalProsperLoans | Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans. |
| TotalProsperPaymentsBilled | Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans. |
| OnTimeProsperPayments | Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans. |
| ProsperPaymentsLessThanOneMonthLate | Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
| ProsperPaymentsOneMonthPlusLate | Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
| ProsperPrincipalBorrowed | Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
| ProsperPrincipalOutstanding | Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
| ScorexChangeAtTimeOfListing | Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans. |
| LoanCurrentDaysDelinquent | The number of days delinquent. |
| LoanFirstDefaultedCycleNumber | The cycle the loan was charged off. If the loan has not charged off the value will be null. |
| LoanMonthsSinceOrigination | Number of months since the loan originated. |
| LoanNumber | Unique numeric value associated with the loan. |
| LoanOriginalAmount | The origination amount of the loan. |
| LoanOriginationDate | The date the loan was originated. |
| LoanOriginationQuarter | The quarter in which the loan was originated. |
| MemberKey | The unique key that is associated with the borrower. This is the same identifier that is used in the API member object. |
| MonthlyLoanPayment | The scheduled monthly loan payment. |
| LP_CustomerPayments | Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
| LP_CustomerPrincipalPayments | Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
| LP_InterestandFees | Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries. |
| LP_ServiceFees | Cumulative service fees paid by the investors who have invested in the loan. |
| LP_CollectionFees | Cumulative collection fees paid by the investors who have invested in the loan. |
| LP_GrossPrincipalLoss | The gross charged off amount of the loan. |
| LP_NetPrincipalLoss | The principal that remains uncollected after any recoveries. |
| LP_NonPrincipalRecoverypayments | The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal. |
| PercentFunded | Percent the listing was funded. |
| Recommendations | Number of recommendations the borrower had at the time the listing was created. |
| InvestmentFromFriendsCount | Number of friends that made an investment in the loan. |
| InvestmentFromFriendsAmount | Dollar amount of investments that were made by friends. |
| Investors | The number of investors that funded the loan. |
print(f'The number of entries are {len(prosper_df)} and no. of unique listing are {prosper_df.ListingKey.nunique()}')
The number of entries are 113937 and no. of unique listing are 113066
From above, we can see that we have duplicate listings we will need check on it.
prosper_df[prosper_df.ListingKey.duplicated() == True].count()['ListingKey']
871
prosper_df[prosper_df.ListingKey.duplicated() == True]['ListingKey'].head()
9 0F043596202561788EA13D5 999 0A0635972629771021E38F3 2539 2D2635984503681189056B4 4942 4B7E3590031274113F1FD34 5812 94B035909835592957503E6 Name: ListingKey, dtype: object
prosper_df[prosper_df.ListingKey == '0A0635972629771021E38F3']
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 998 | 0A0635972629771021E38F3 | 1055932 | 2013-12-18 14:25:17.053000000 | NaN | 60 | Current | NaN | 0.1771 | 0.1535 | 0.1435 | 0.13507 | 0.0599 | 0.07517 | 5.0 | B | 6.0 | 1 | MD | Other | Employed | 100.0 | True | False | NaN | 2014-02-08 11:39:13 | 760.0 | 779.0 | 1994-11-01 00:00:00 | 16.0 | 16.0 | 34.0 | 15 | 623.0 | 1.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8028.0 | 0.27 | 14190.0 | 32.0 | 0.96 | 1.0 | 0.35 | $25,000-49,999 | True | 3750.0 | CB643706582969191E8C26B | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 1 | 132043 | 10000 | 2014-02-12 00:00:00 | Q1 2014 | 98A937006186534818378B9 | 239.74 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 999 | 0A0635972629771021E38F3 | 1055932 | 2013-12-18 14:25:17.053000000 | NaN | 60 | Current | NaN | 0.1771 | 0.1535 | 0.1435 | 0.13507 | 0.0599 | 0.07517 | 5.0 | B | 7.0 | 1 | MD | Other | Employed | 100.0 | True | False | NaN | 2014-02-08 11:39:13 | 760.0 | 779.0 | 1994-11-01 00:00:00 | 16.0 | 16.0 | 34.0 | 15 | 623.0 | 1.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8028.0 | 0.27 | 14190.0 | 32.0 | 0.96 | 1.0 | 0.35 | $25,000-49,999 | True | 3750.0 | CB643706582969191E8C26B | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 1 | 132043 | 10000 | 2014-02-12 00:00:00 | Q1 2014 | 98A937006186534818378B9 | 239.74 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 57494 | 0A0635972629771021E38F3 | 1055932 | 2013-12-18 14:25:17.053000000 | NaN | 60 | Current | NaN | 0.1771 | 0.1535 | 0.1435 | 0.13507 | 0.0599 | 0.07517 | 5.0 | B | 8.0 | 1 | MD | Other | Employed | 100.0 | True | False | NaN | 2014-02-08 11:39:13 | 760.0 | 779.0 | 1994-11-01 00:00:00 | 16.0 | 16.0 | 34.0 | 15 | 623.0 | 1.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8028.0 | 0.27 | 14190.0 | 32.0 | 0.96 | 1.0 | 0.35 | $25,000-49,999 | True | 3750.0 | CB643706582969191E8C26B | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 1 | 132043 | 10000 | 2014-02-12 00:00:00 | Q1 2014 | 98A937006186534818378B9 | 239.74 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
prosper_df[prosper_df.ListingKey == '2D2635984503681189056B4']
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1554 | 2D2635984503681189056B4 | 1097584 | 2014-01-11 07:50:08.037000000 | NaN | 60 | Current | NaN | 0.14528 | 0.1224 | 0.1124 | 0.10731 | 0.0424 | 0.06491 | 5.0 | B | 9.0 | 1 | NJ | Police Officer/Correction Officer | Employed | 107.0 | True | False | NaN | 2014-03-07 10:04:00 | 720.0 | 739.0 | 1996-07-01 00:00:00 | 19.0 | 18.0 | 27.0 | 15 | 1024.0 | 1.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20802.0 | 0.4 | 26503.0 | 26.0 | 1.0 | 1.0 | 0.18 | $100,000+ | True | 10833.333333 | DC303708663304668E74587 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 135375 | 35000 | 2014-03-11 00:00:00 | Q1 2014 | 07B33702526608187DB957A | 782.81 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2539 | 2D2635984503681189056B4 | 1097584 | 2014-01-11 07:50:08.037000000 | NaN | 60 | Current | NaN | 0.14528 | 0.1224 | 0.1124 | 0.10731 | 0.0424 | 0.06491 | 5.0 | B | 10.0 | 1 | NJ | Police Officer/Correction Officer | Employed | 107.0 | True | False | NaN | 2014-03-07 10:04:00 | 720.0 | 739.0 | 1996-07-01 00:00:00 | 19.0 | 18.0 | 27.0 | 15 | 1024.0 | 1.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20802.0 | 0.4 | 26503.0 | 26.0 | 1.0 | 1.0 | 0.18 | $100,000+ | True | 10833.333333 | DC303708663304668E74587 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 135375 | 35000 | 2014-03-11 00:00:00 | Q1 2014 | 07B33702526608187DB957A | 782.81 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
It seems like we get duplicate listingKeys because duplicate rows have different ProsperScore. We will drop those rows with duplicate keys since they are not many and won't intefere much with our analysis.
prosper_df.drop_duplicates(subset = 'ListingKey', inplace = True)
prosper_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 113066 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113066 non-null object 1 ListingNumber 113066 non-null int64 2 ListingCreationDate 113066 non-null object 3 CreditGrade 28953 non-null object 4 Term 113066 non-null int64 5 LoanStatus 113066 non-null object 6 ClosedDate 55076 non-null object 7 BorrowerAPR 113041 non-null float64 8 BorrowerRate 113066 non-null float64 9 LenderYield 113066 non-null float64 10 EstimatedEffectiveYield 83982 non-null float64 11 EstimatedLoss 83982 non-null float64 12 EstimatedReturn 83982 non-null float64 13 ProsperRating (numeric) 83982 non-null float64 14 ProsperRating (Alpha) 83982 non-null object 15 ProsperScore 83982 non-null float64 16 ListingCategory (numeric) 113066 non-null int64 17 BorrowerState 107551 non-null object 18 Occupation 109537 non-null object 19 EmploymentStatus 110811 non-null object 20 EmploymentStatusDuration 105441 non-null float64 21 IsBorrowerHomeowner 113066 non-null bool 22 CurrentlyInGroup 113066 non-null bool 23 GroupKey 13339 non-null object 24 DateCreditPulled 113066 non-null object 25 CreditScoreRangeLower 112475 non-null float64 26 CreditScoreRangeUpper 112475 non-null float64 27 FirstRecordedCreditLine 112369 non-null object 28 CurrentCreditLines 105462 non-null float64 29 OpenCreditLines 105462 non-null float64 30 TotalCreditLinespast7years 112369 non-null float64 31 OpenRevolvingAccounts 113066 non-null int64 32 OpenRevolvingMonthlyPayment 113066 non-null float64 33 InquiriesLast6Months 112369 non-null float64 34 TotalInquiries 111907 non-null float64 35 CurrentDelinquencies 112369 non-null float64 36 AmountDelinquent 105444 non-null float64 37 DelinquenciesLast7Years 112076 non-null float64 38 PublicRecordsLast10Years 112369 non-null float64 39 PublicRecordsLast12Months 105462 non-null float64 40 RevolvingCreditBalance 105462 non-null float64 41 BankcardUtilization 105462 non-null float64 42 AvailableBankcardCredit 105522 non-null float64 43 TotalTrades 105522 non-null float64 44 TradesNeverDelinquent (percentage) 105522 non-null float64 45 TradesOpenedLast6Months 105522 non-null float64 46 DebtToIncomeRatio 104594 non-null float64 47 IncomeRange 113066 non-null object 48 IncomeVerifiable 113066 non-null bool 49 StatedMonthlyIncome 113066 non-null float64 50 LoanKey 113066 non-null object 51 TotalProsperLoans 21923 non-null float64 52 TotalProsperPaymentsBilled 21923 non-null float64 53 OnTimeProsperPayments 21923 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 21923 non-null float64 55 ProsperPaymentsOneMonthPlusLate 21923 non-null float64 56 ProsperPrincipalBorrowed 21923 non-null float64 57 ProsperPrincipalOutstanding 21923 non-null float64 58 ScorexChangeAtTimeOfListing 18912 non-null float64 59 LoanCurrentDaysDelinquent 113066 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113066 non-null int64 62 LoanNumber 113066 non-null int64 63 LoanOriginalAmount 113066 non-null int64 64 LoanOriginationDate 113066 non-null object 65 LoanOriginationQuarter 113066 non-null object 66 MemberKey 113066 non-null object 67 MonthlyLoanPayment 113066 non-null float64 68 LP_CustomerPayments 113066 non-null float64 69 LP_CustomerPrincipalPayments 113066 non-null float64 70 LP_InterestandFees 113066 non-null float64 71 LP_ServiceFees 113066 non-null float64 72 LP_CollectionFees 113066 non-null float64 73 LP_GrossPrincipalLoss 113066 non-null float64 74 LP_NetPrincipalLoss 113066 non-null float64 75 LP_NonPrincipalRecoverypayments 113066 non-null float64 76 PercentFunded 113066 non-null float64 77 Recommendations 113066 non-null int64 78 InvestmentFromFriendsCount 113066 non-null int64 79 InvestmentFromFriendsAmount 113066 non-null float64 80 Investors 113066 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.5+ MB
I will use Pandas profiling to understand each variable individually and select the best variables for understanding what factors affect orrower's Annual Percentage Rate (APR) for the loan.
#profile = ProfileReport(prosper_df, title="Prosper Data Profiling Report", minimal=True)
#profile.to_file("Prosper Data Profiling Report.html")
#profile
There are many learnings from above report such as distributions, column descriptions, warnings such as high number of zeros in a columns, columns which are wrong datatype such as ListingCreationDate which we will clean later. After much deliberation, I have selected 16 columns which I feel will give us relevant learnings.
# this variable subset will be a starting point for further cleaning and investigation of relevant data
prosper_reduced_data = prosper_df[['ListingKey', 'ListingCreationDate', 'ClosedDate', 'LoanStatus', 'Term', 'LoanOriginalAmount', 'MonthlyLoanPayment',\
'ListingCategory (numeric)', 'BorrowerAPR', 'BorrowerRate', \
'CreditGrade', 'ProsperRating (Alpha)', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',\
'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'DebtToIncomeRatio', 'EmploymentStatus',\
'EmploymentStatusDuration','IsBorrowerHomeowner']]
prosper_reduced_data.head()
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory (numeric) | BorrowerAPR | BorrowerRate | CreditGrade | ProsperRating (Alpha) | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263000000 | 2009-08-14 00:00:00 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | C | NaN | 640.0 | 659.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900000000 | NaN | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | NaN | A | 680.0 | 699.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090000000 | 2009-12-17 00:00:00 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | HR | NaN | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010000000 | NaN | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | NaN | A | 800.0 | 819.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097000000 | NaN | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | NaN | D | 680.0 | 699.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True |
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 113066 entries, 0 to 113936 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113066 non-null object 1 ListingCreationDate 113066 non-null object 2 ClosedDate 55076 non-null object 3 LoanStatus 113066 non-null object 4 Term 113066 non-null int64 5 LoanOriginalAmount 113066 non-null int64 6 MonthlyLoanPayment 113066 non-null float64 7 ListingCategory (numeric) 113066 non-null int64 8 BorrowerAPR 113041 non-null float64 9 BorrowerRate 113066 non-null float64 10 CreditGrade 28953 non-null object 11 ProsperRating (Alpha) 83982 non-null object 12 CreditScoreRangeLower 112475 non-null float64 13 CreditScoreRangeUpper 112475 non-null float64 14 IncomeRange 113066 non-null object 15 IncomeVerifiable 113066 non-null bool 16 StatedMonthlyIncome 113066 non-null float64 17 DebtToIncomeRatio 104594 non-null float64 18 EmploymentStatus 110811 non-null object 19 EmploymentStatusDuration 105441 non-null float64 20 IsBorrowerHomeowner 113066 non-null bool dtypes: bool(2), float64(8), int64(3), object(8) memory usage: 17.5+ MB
prosper_reduced_data.describe(include= 'all')
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory (numeric) | BorrowerAPR | BorrowerRate | CreditGrade | ProsperRating (Alpha) | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 113066 | 113066 | 55076 | 113066 | 113066.00000 | 113066.000000 | 113066.000000 | 113066.000000 | 113041.00000 | 113066.000000 | 28953 | 83982 | 112475.000000 | 112475.000000 | 113066 | 113066 | 1.130660e+05 | 104594.000000 | 110811 | 105441.000000 | 113066 |
| unique | 113066 | 113064 | 2802 | 12 | NaN | NaN | NaN | NaN | NaN | NaN | 8 | 7 | NaN | NaN | 8 | 2 | NaN | NaN | 8 | NaN | 2 |
| top | 1021339766868145413AB3B | 2013-06-03 17:27:50.540000000 | 2014-03-04 00:00:00 | Current | NaN | NaN | NaN | NaN | NaN | NaN | C | C | NaN | NaN | $25,000-49,999 | True | NaN | NaN | Employed | NaN | True |
| freq | 1 | 2 | 105 | 55730 | NaN | NaN | NaN | NaN | NaN | NaN | 5649 | 18096 | NaN | NaN | 31940 | 104479 | NaN | NaN | 66598 | NaN | 57052 |
| mean | NaN | NaN | NaN | NaN | 40.80017 | 8314.762307 | 271.932742 | 2.776838 | 0.21898 | 0.192946 | NaN | NaN | 685.524961 | 704.524961 | NaN | NaN | 5.605120e+03 | 0.276032 | NaN | 96.060584 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| min | NaN | NaN | NaN | NaN | 12.00000 | 1000.000000 | 0.000000 | 0.000000 | 0.00653 | 0.000000 | NaN | NaN | 0.000000 | 19.000000 | NaN | NaN | 0.000000e+00 | 0.000000 | NaN | 0.000000 | NaN |
| 25% | NaN | NaN | NaN | NaN | 36.00000 | 4000.000000 | 130.950000 | 1.000000 | 0.15629 | 0.134000 | NaN | NaN | 660.000000 | 679.000000 | NaN | NaN | 3.199396e+03 | 0.140000 | NaN | 26.000000 | NaN |
| 50% | NaN | NaN | NaN | NaN | 36.00000 | 6300.000000 | 217.370000 | 1.000000 | 0.20984 | 0.184000 | NaN | NaN | 680.000000 | 699.000000 | NaN | NaN | 4.666667e+03 | 0.220000 | NaN | 67.000000 | NaN |
| 75% | NaN | NaN | NaN | NaN | 36.00000 | 12000.000000 | 370.570000 | 3.000000 | 0.28386 | 0.250600 | NaN | NaN | 720.000000 | 739.000000 | NaN | NaN | 6.824688e+03 | 0.320000 | NaN | 137.000000 | NaN |
| max | NaN | NaN | NaN | NaN | 60.00000 | 35000.000000 | 2251.510000 | 20.000000 | 0.51229 | 0.497500 | NaN | NaN | 880.000000 | 899.000000 | NaN | NaN | 1.750003e+06 | 10.010000 | NaN | 755.000000 | NaN |
#profile = ProfileReport(prosper_reduced_data, title="Prosper Data Profiling Report", explorative=True)
#profile.to_file("Reduced Prosper Data Profiling Report.html")
#profile
#Convert Listing Creation date and Closed Date to datetime
prosper_reduced_data.ListingCreationDate = pd.to_datetime(prosper_reduced_data.ListingCreationDate, errors='coerce')
prosper_reduced_data.ClosedDate = pd.to_datetime(prosper_reduced_data.ClosedDate, errors='coerce')
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 113066 entries, 0 to 113936 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113066 non-null object 1 ListingCreationDate 113066 non-null datetime64[ns] 2 ClosedDate 55076 non-null datetime64[ns] 3 LoanStatus 113066 non-null object 4 Term 113066 non-null int64 5 LoanOriginalAmount 113066 non-null int64 6 MonthlyLoanPayment 113066 non-null float64 7 ListingCategory (numeric) 113066 non-null int64 8 BorrowerAPR 113041 non-null float64 9 BorrowerRate 113066 non-null float64 10 CreditGrade 28953 non-null object 11 ProsperRating (Alpha) 83982 non-null object 12 CreditScoreRangeLower 112475 non-null float64 13 CreditScoreRangeUpper 112475 non-null float64 14 IncomeRange 113066 non-null object 15 IncomeVerifiable 113066 non-null bool 16 StatedMonthlyIncome 113066 non-null float64 17 DebtToIncomeRatio 104594 non-null float64 18 EmploymentStatus 110811 non-null object 19 EmploymentStatusDuration 105441 non-null float64 20 IsBorrowerHomeowner 113066 non-null bool dtypes: bool(2), datetime64[ns](2), float64(8), int64(3), object(6) memory usage: 17.5+ MB
#change column names with spaces
prosper_reduced_data.rename(columns={'ListingCategory (numeric)': 'ListingCategory', 'ProsperRating (Alpha)': 'ProsperRating'}, inplace=True)
prosper_reduced_data.head()
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditGrade | ProsperRating | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | C | NaN | 640.0 | 659.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | NaN | A | 680.0 | 699.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | HR | NaN | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | NaN | A | 800.0 | 819.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | NaN | D | 680.0 | 699.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True |
#checking on missing values in Employment status
prosper_reduced_data.EmploymentStatus.value_counts()
Employed 66598 Full-time 26354 Self-employed 6052 Not available 5347 Other 3742 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
prosper_reduced_data[prosper_reduced_data.EmploymentStatus.isna()].sort_values(by = 'ListingCreationDate')
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditGrade | ProsperRating | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113921 | D21A33647075665665BE266 | 2005-11-09 20:44:28.847 | 2005-11-25 | Completed | 36 | 1000 | 29.52 | 0 | NaN | 0.0400 | AA | NaN | NaN | NaN | Not displayed | True | 12500.000000 | 0.23284 | NaN | NaN | False |
| 106735 | 97B333660731451030726AF | 2005-11-11 08:46:17.303 | 2006-02-07 | Completed | 36 | 1500 | 45.60 | 0 | NaN | 0.0595 | AA | NaN | NaN | NaN | Not displayed | True | 14583.333333 | 0.01747 | NaN | NaN | False |
| 25166 | B3083364738592238DBDA63 | 2005-11-15 16:06:47.457 | 2005-11-29 | Completed | 36 | 1000 | 29.68 | 0 | NaN | 0.0435 | AA | NaN | NaN | NaN | Not displayed | True | 9166.666667 | 0.04120 | NaN | NaN | False |
| 62297 | 2F25336514614362295DA03 | 2005-11-15 18:25:56.107 | 2005-11-30 | Completed | 36 | 1000 | 35.65 | 0 | NaN | 0.1700 | E | NaN | NaN | NaN | Not displayed | True | 10000.000000 | 0.01783 | NaN | NaN | False |
| 90598 | B87F3364623805558A45EAB | 2005-11-18 13:39:03.263 | 2005-11-29 | Completed | 36 | 1000 | 33.21 | 0 | NaN | 0.1200 | B | NaN | NaN | NaN | Not displayed | True | 16666.666667 | 0.01051 | NaN | NaN | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 57106 | 21F83380195431266D92121 | 2007-02-05 20:54:50.233 | 2007-12-10 | Completed | 36 | 5000 | 170.89 | 0 | 0.14709 | 0.1400 | D | NaN | 620.0 | 639.0 | Not displayed | True | 5416.666667 | 0.07000 | NaN | NaN | False |
| 13239 | B1A13379859311513B13011 | 2007-02-07 11:53:37.327 | 2010-01-20 | Completed | 36 | 2000 | 61.75 | 0 | 0.07852 | 0.0700 | B | NaN | 700.0 | 719.0 | Not displayed | True | 2333.333333 | 0.26000 | NaN | NaN | False |
| 22979 | 57AB338020191550932BC4A | 2007-02-07 12:03:49.340 | 2010-02-02 | Completed | 36 | 2000 | 64.42 | 0 | 0.15239 | 0.1435 | D | NaN | 600.0 | 619.0 | Not displayed | True | 1916.666667 | 0.04000 | NaN | NaN | False |
| 66469 | 6F793380667470304C450E0 | 2007-02-07 21:07:47.410 | 2007-09-04 | Completed | 36 | 4000 | 140.41 | 0 | 0.16607 | 0.1589 | D | NaN | 600.0 | 619.0 | Not displayed | True | 2500.000000 | 0.09000 | NaN | NaN | False |
| 7312 | 00B1338100511182555E371 | 2007-02-09 08:00:09.510 | 2007-05-16 | Completed | 36 | 5000 | 159.81 | 0 | 0.10039 | 0.0935 | A | NaN | 740.0 | 759.0 | Not displayed | True | 5250.000000 | 0.08000 | NaN | NaN | True |
df = prosper_reduced_data[prosper_reduced_data.EmploymentStatus.isna()].sort_values(by = 'ListingCreationDate')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2255 entries, 113921 to 7312 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 2255 non-null object 1 ListingCreationDate 2255 non-null datetime64[ns] 2 ClosedDate 2255 non-null datetime64[ns] 3 LoanStatus 2255 non-null object 4 Term 2255 non-null int64 5 LoanOriginalAmount 2255 non-null int64 6 MonthlyLoanPayment 2255 non-null float64 7 ListingCategory 2255 non-null int64 8 BorrowerAPR 2230 non-null float64 9 BorrowerRate 2255 non-null float64 10 CreditGrade 2255 non-null object 11 ProsperRating 0 non-null object 12 CreditScoreRangeLower 1666 non-null float64 13 CreditScoreRangeUpper 1666 non-null float64 14 IncomeRange 2255 non-null object 15 IncomeVerifiable 2255 non-null bool 16 StatedMonthlyIncome 2255 non-null float64 17 DebtToIncomeRatio 2226 non-null float64 18 EmploymentStatus 0 non-null object 19 EmploymentStatusDuration 0 non-null float64 20 IsBorrowerHomeowner 2255 non-null bool dtypes: bool(2), datetime64[ns](2), float64(8), int64(3), object(6) memory usage: 356.7+ KB
df.LoanStatus.value_counts()
Completed 1439 Defaulted 522 Chargedoff 291 Cancelled 3 Name: LoanStatus, dtype: int64
#get no. of listings without Employment Status
len(prosper_reduced_data[prosper_reduced_data.EmploymentStatus.isna()])
2255
We have 2255 Records missing employment status, they all seem to data recorded between 2005 to 2007 and they are closed. Due to the small no. of accounts compared to entire df, I will remove those accounts from our analysis
prosper_reduced_data = prosper_reduced_data[prosper_reduced_data.EmploymentStatus.notna()]
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 110811 non-null object 1 ListingCreationDate 110811 non-null datetime64[ns] 2 ClosedDate 52821 non-null datetime64[ns] 3 LoanStatus 110811 non-null object 4 Term 110811 non-null int64 5 LoanOriginalAmount 110811 non-null int64 6 MonthlyLoanPayment 110811 non-null float64 7 ListingCategory 110811 non-null int64 8 BorrowerAPR 110811 non-null float64 9 BorrowerRate 110811 non-null float64 10 CreditGrade 26698 non-null object 11 ProsperRating 83982 non-null object 12 CreditScoreRangeLower 110809 non-null float64 13 CreditScoreRangeUpper 110809 non-null float64 14 IncomeRange 110811 non-null object 15 IncomeVerifiable 110811 non-null bool 16 StatedMonthlyIncome 110811 non-null float64 17 DebtToIncomeRatio 102368 non-null float64 18 EmploymentStatus 110811 non-null object 19 EmploymentStatusDuration 105441 non-null float64 20 IsBorrowerHomeowner 110811 non-null bool dtypes: bool(2), datetime64[ns](2), float64(8), int64(3), object(6) memory usage: 17.1+ MB
#
prosper_reduced_data.IncomeRange.value_counts()
$25,000-49,999 31940 $50,000-74,999 30749 $100,000+ 17188 $75,000-99,999 16780 $1-24,999 7241 Not displayed 5486 Not employed 806 $0 621 Name: IncomeRange, dtype: int64
# reduce no. of IncomeRange categories by 1
prosper_reduced_data.IncomeRange.replace({'$0': 'Not employed'}, inplace=True)
#test above code
prosper_reduced_data.IncomeRange.value_counts()
$25,000-49,999 31940 $50,000-74,999 30749 $100,000+ 17188 $75,000-99,999 16780 $1-24,999 7241 Not displayed 5486 Not employed 1427 Name: IncomeRange, dtype: int64
Using Pandas Profiling report, CreditGrade and ProsperRating seems like they have the same column values. Let's deep dive into that and clean if necessary
prosper_reduced_data.CreditGrade.value_counts()
C 5282 D 4799 B 4149 AA 3296 A 3128 HR 3061 E 2884 NC 99 Name: CreditGrade, dtype: int64
prosper_reduced_data.ProsperRating.value_counts()
C 18096 B 15368 A 14390 D 14170 E 9716 HR 6917 AA 5325 Name: ProsperRating, dtype: int64
Checking relationship between Credit Grade and Prosper Rating. I will take a sample and see if there are similarities between the 2 Scoring systems
prosper_reduced_data[['ListingKey','CreditGrade', 'ProsperRating']].sample(30)
| ListingKey | CreditGrade | ProsperRating | |
|---|---|---|---|
| 3786 | 0E343581157529075FC619B | NaN | D |
| 28338 | BCD635286314316442233F1 | NaN | D |
| 40730 | 97413544990144310CB0DA0 | NaN | HR |
| 93957 | D51D351501389539177927A | NaN | E |
| 32152 | 6E693423836664647DDF678 | C | NaN |
| ... | ... | ... | ... |
| 21708 | F87C353164153037139EB8A | NaN | A |
| 30495 | 6D2B34200402236278C99FD | C | NaN |
| 2090 | 733335868973606601F6BB4 | NaN | B |
| 69454 | E1C33516664169197938BC5 | NaN | D |
| 38079 | 52F4337170122538069DC2C | E | NaN |
There seems to be situation where if creditGrade is missing the ProsperRaating is available. I will combine the two Scores and use the combined column as CreditRiskScore
prosper_reduced_data['CreditRiskScore'] = np.where(prosper_reduced_data.CreditGrade.isna(), prosper_reduced_data.ProsperRating, prosper_reduced_data.CreditGrade)
prosper_reduced_data.head(10)
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditGrade | ProsperRating | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | C | NaN | 640.0 | 659.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True | C |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | NaN | A | 680.0 | 699.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False | A |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | HR | NaN | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | NaN | A | 800.0 | 819.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True | A |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | NaN | D | 680.0 | 699.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True | D |
| 5 | 0F05359734824199381F61D | 2013-12-14 08:26:37.093 | NaT | Current | 60 | 15000 | 342.37 | 1 | 0.15425 | 0.1314 | NaN | B | 740.0 | 759.0 | $100,000+ | True | 8333.333333 | 0.36 | Employed | 82.0 | True | B |
| 6 | 0F0A3576754255009D63151 | 2013-04-12 09:52:56.147 | NaT | Current | 36 | 3000 | 122.67 | 1 | 0.31032 | 0.2712 | NaN | E | 680.0 | 699.0 | $25,000-49,999 | True | 2083.333333 | 0.27 | Employed | 172.0 | False | E |
| 7 | 0F1035772717087366F9EA7 | 2013-05-05 06:49:27.493 | NaT | Current | 36 | 10000 | 372.60 | 2 | 0.23939 | 0.2019 | NaN | C | 700.0 | 719.0 | $25,000-49,999 | True | 3355.750000 | 0.24 | Employed | 103.0 | False | C |
| 8 | 0F043596202561788EA13D5 | 2013-12-02 10:43:39.117 | NaT | Current | 36 | 10000 | 305.54 | 7 | 0.07620 | 0.0629 | NaN | AA | 820.0 | 839.0 | $25,000-49,999 | True | 3333.333333 | 0.25 | Employed | 269.0 | True | AA |
| 10 | 0F123545674891886D9F106 | 2012-05-10 07:04:01.577 | NaT | Current | 60 | 13500 | 395.37 | 1 | 0.27462 | 0.2489 | NaN | C | 640.0 | 659.0 | $75,000-99,999 | True | 7500.000000 | 0.12 | Employed | 300.0 | False | C |
prosper_reduced_data.CreditRiskScore.value_counts()
C 23378 B 19517 D 18969 A 17518 E 12600 HR 9978 AA 8621 NC 99 Name: CreditRiskScore, dtype: int64
# the other two ratings columns will not be needed anymore
prosper_reduced_data.drop(['CreditGrade', 'ProsperRating'], axis=1, inplace=True)
prosper_reduced_data
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | 640.0 | 659.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True | C |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | 680.0 | 699.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False | A |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | 800.0 | 819.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True | A |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | 680.0 | 699.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True | D |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 2013-04-14 05:55:02.663 | NaT | Current | 36 | 10000 | 364.74 | 1 | 0.22354 | 0.1864 | 700.0 | 719.0 | $50,000-74,999 | True | 4333.333333 | 0.13 | Employed | 246.0 | True | C |
| 113933 | E6DB353036033497292EE43 | 2011-11-03 20:42:55.333 | NaT | FinalPaymentInProgress | 36 | 2000 | 65.57 | 7 | 0.13220 | 0.1110 | 700.0 | 719.0 | $75,000-99,999 | True | 8041.666667 | 0.11 | Employed | 21.0 | True | A |
| 113934 | E6E13596170052029692BB1 | 2013-12-13 05:49:12.703 | NaT | Current | 60 | 10000 | 273.35 | 1 | 0.23984 | 0.2150 | 700.0 | 719.0 | $25,000-49,999 | True | 2875.000000 | 0.51 | Employed | 84.0 | True | D |
| 113935 | E6EB3531504622671970D9E | 2011-11-14 13:18:26.597 | 2013-08-13 | Completed | 60 | 15000 | 449.55 | 2 | 0.28408 | 0.2605 | 680.0 | 699.0 | $25,000-49,999 | True | 3875.000000 | 0.48 | Full-time | 94.0 | True | C |
| 113936 | E6ED3600409833199F711B7 | 2014-01-15 09:27:37.657 | NaT | Current | 36 | 2000 | 64.90 | 1 | 0.13189 | 0.1039 | 680.0 | 699.0 | $50,000-74,999 | True | 4583.333333 | 0.23 | Employed | 244.0 | False | A |
#Employment Duration
borrowers_without_employment_duration_df = prosper_reduced_data[prosper_reduced_data.EmploymentStatusDuration.isnull() == True]
borrowers_without_employment_duration_df
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 17 | 0FDF336575182398852A8DC | 2006-08-15 12:21:09.433 | 2008-05-22 | Chargedoff | 36 | 10000 | 334.54 | 0 | 0.13202 | 0.1250 | 760.0 | 779.0 | Not displayed | True | 5833.333333 | 0.12 | Not available | NaN | True | AA |
| 42 | 10AD33803102786533722BB | 2007-02-04 17:24:27.850 | 2008-08-08 | Chargedoff | 36 | 1500 | 62.86 | 0 | 0.30300 | 0.2900 | 520.0 | 539.0 | Not displayed | True | 416.666667 | 0.27 | Not available | NaN | False | HR |
| 63 | 0F1A336989050248528ED31 | 2006-10-07 00:32:48.020 | 2007-04-13 | Completed | 36 | 6000 | 251.43 | 0 | 0.29776 | 0.2900 | 620.0 | 639.0 | Not displayed | True | 7500.000000 | 0.25 | Not available | NaN | True | D |
| 76 | 0DFD336490760860963B515 | 2006-07-11 20:48:47.660 | 2007-08-10 | Completed | 36 | 2200 | 87.47 | 0 | 0.25861 | 0.2500 | 540.0 | 559.0 | Not displayed | True | 8583.333333 | 0.13 | Not available | NaN | True | E |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113797 | E467337109094381251D0BB | 2006-10-18 11:35:37.843 | 2008-03-30 | Defaulted | 36 | 4500 | 147.32 | 0 | 0.11696 | 0.1100 | 660.0 | 679.0 | Not displayed | True | 4166.666667 | 0.04 | Not available | NaN | False | C |
| 113819 | CF913368831369031FEEAB6 | 2006-09-14 13:19:21.320 | 2009-09-03 | Completed | 36 | 3000 | 107.56 | 0 | 0.18123 | 0.1740 | 660.0 | 679.0 | Not displayed | True | 4583.333333 | 0.13 | Not available | NaN | True | C |
| 113825 | CFB1338127898040944BD47 | 2007-02-05 11:42:34.790 | 2010-01-13 | Completed | 36 | 2550 | 106.86 | 0 | 0.29776 | 0.2900 | 520.0 | 539.0 | Not displayed | True | 8833.333333 | 0.08 | Not available | NaN | False | HR |
| 113848 | FFF0338012502972731B2C1 | 2007-02-07 15:04:40.563 | 2010-02-13 | Completed | 36 | 1000 | 0.00 | 0 | 0.30709 | 0.2875 | 0.0 | 19.0 | Not displayed | True | 833.333333 | NaN | Not available | NaN | False | NC |
| 113926 | E34333654295683286C35C2 | 2006-08-01 10:31:31.143 | 2007-09-06 | Defaulted | 36 | 3000 | 125.72 | 0 | 0.29776 | 0.2900 | 540.0 | 559.0 | Not displayed | True | 5416.666667 | 0.07 | Not available | NaN | False | E |
print(f'min Date without employment duration is {borrowers_without_employment_duration_df.ListingCreationDate.min()} and max date was {borrowers_without_employment_duration_df.ListingCreationDate.max()}')
min Date without employment duration is 2006-02-15 14:21:25.347000 and max date was 2013-09-16 14:47:16.037000
borrowers_without_employment_duration_df.EmploymentStatus.value_counts()
Not available 5347 Full-time 13 Other 6 Self-employed 2 Not employed 1 Employed 1 Name: EmploymentStatus, dtype: int64
#non-available borrowers without employment status in raw data before filtering
prosper_reduced_data[prosper_reduced_data.EmploymentStatus == 'Not available']['EmploymentStatusDuration'].head()
2 NaN 17 NaN 42 NaN 63 NaN 76 NaN Name: EmploymentStatusDuration, dtype: float64
#most of the borrowers didn't give employmentStatus when onboarded, I will let it as it is
# Convert CreditGrade,IncomeRange as ordered categorical vars
var_dict = {'CreditRiskScore': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
'IncomeRange': ['Not displayed', 'Not employed', '$1-24,999',
'$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+']}
for var in var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = var_dict[var])
prosper_reduced_data[var] = prosper_reduced_data[var].astype(ordered_var)
A data set from Prosper, which is America’s first marketplace lending platform, with over $7 billion in funded loans. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
What factors affect borrower's APR score.
The supporting features of interest are the loan characteristics on the one hand, such as the loan amount, the date when the loan was taken, the term of the loan, the status of the loan, and some borrower characteristics on the other hand, such as the income, the employment and housing situation and debt to income ration of the borrower.
prosper_reduced_data.head()
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | CreditScoreRangeUpper | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | 640.0 | 659.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True | C |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | 680.0 | 699.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False | A |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | 480.0 | 499.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | 800.0 | 819.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True | A |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | 680.0 | 699.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True | D |
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 110811 non-null object 1 ListingCreationDate 110811 non-null datetime64[ns] 2 ClosedDate 52821 non-null datetime64[ns] 3 LoanStatus 110811 non-null object 4 Term 110811 non-null int64 5 LoanOriginalAmount 110811 non-null int64 6 MonthlyLoanPayment 110811 non-null float64 7 ListingCategory 110811 non-null int64 8 BorrowerAPR 110811 non-null float64 9 BorrowerRate 110811 non-null float64 10 CreditScoreRangeLower 110809 non-null float64 11 CreditScoreRangeUpper 110809 non-null float64 12 IncomeRange 110811 non-null category 13 IncomeVerifiable 110811 non-null bool 14 StatedMonthlyIncome 110811 non-null float64 15 DebtToIncomeRatio 102368 non-null float64 16 EmploymentStatus 110811 non-null object 17 EmploymentStatusDuration 105441 non-null float64 18 IsBorrowerHomeowner 110811 non-null bool 19 CreditRiskScore 110680 non-null category dtypes: bool(2), category(2), datetime64[ns](2), float64(8), int64(3), object(3) memory usage: 14.8+ MB
Convert it from int to category
prosper_reduced_data['ListingCategory'] = prosper_reduced_data['ListingCategory'].astype('category')
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 110811 non-null object 1 ListingCreationDate 110811 non-null datetime64[ns] 2 ClosedDate 52821 non-null datetime64[ns] 3 LoanStatus 110811 non-null object 4 Term 110811 non-null int64 5 LoanOriginalAmount 110811 non-null int64 6 MonthlyLoanPayment 110811 non-null float64 7 ListingCategory 110811 non-null category 8 BorrowerAPR 110811 non-null float64 9 BorrowerRate 110811 non-null float64 10 CreditScoreRangeLower 110809 non-null float64 11 CreditScoreRangeUpper 110809 non-null float64 12 IncomeRange 110811 non-null category 13 IncomeVerifiable 110811 non-null bool 14 StatedMonthlyIncome 110811 non-null float64 15 DebtToIncomeRatio 102368 non-null float64 16 EmploymentStatus 110811 non-null object 17 EmploymentStatusDuration 105441 non-null float64 18 IsBorrowerHomeowner 110811 non-null bool 19 CreditRiskScore 110680 non-null category dtypes: bool(2), category(3), datetime64[ns](2), float64(8), int64(2), object(3) memory usage: 14.1+ MB
#get histogram for all numerical columns
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
prosper_reduced_data_numerical = prosper_reduced_data.select_dtypes(include=numerics)
prosper_reduced_data_numerical
| Term | LoanOriginalAmount | MonthlyLoanPayment | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | CreditScoreRangeUpper | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatusDuration | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 9425 | 330.43 | 0.16516 | 0.1580 | 640.0 | 659.0 | 3083.333333 | 0.17 | 2.0 |
| 1 | 36 | 10000 | 318.93 | 0.12016 | 0.0920 | 680.0 | 699.0 | 6125.000000 | 0.18 | 44.0 |
| 2 | 36 | 3001 | 123.32 | 0.28269 | 0.2750 | 480.0 | 499.0 | 2083.333333 | 0.06 | NaN |
| 3 | 36 | 10000 | 321.45 | 0.12528 | 0.0974 | 800.0 | 819.0 | 2875.000000 | 0.15 | 113.0 |
| 4 | 36 | 15000 | 563.97 | 0.24614 | 0.2085 | 680.0 | 699.0 | 9583.333333 | 0.26 | 44.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | 36 | 10000 | 364.74 | 0.22354 | 0.1864 | 700.0 | 719.0 | 4333.333333 | 0.13 | 246.0 |
| 113933 | 36 | 2000 | 65.57 | 0.13220 | 0.1110 | 700.0 | 719.0 | 8041.666667 | 0.11 | 21.0 |
| 113934 | 60 | 10000 | 273.35 | 0.23984 | 0.2150 | 700.0 | 719.0 | 2875.000000 | 0.51 | 84.0 |
| 113935 | 60 | 15000 | 449.55 | 0.28408 | 0.2605 | 680.0 | 699.0 | 3875.000000 | 0.48 | 94.0 |
| 113936 | 36 | 2000 | 64.90 | 0.13189 | 0.1039 | 680.0 | 699.0 | 4583.333333 | 0.23 | 244.0 |
list(prosper_reduced_data_numerical.columns)
['Term', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'BorrowerAPR', 'BorrowerRate', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'StatedMonthlyIncome', 'DebtToIncomeRatio', 'EmploymentStatusDuration']
plt.rcParams['figure.figsize'] = (20, 10)
#histograms
prosper_reduced_data_numerical.hist(list(prosper_reduced_data_numerical.columns));
Note for above visualization:
LoanOriginalAmount, EmploymentstatusDuration, MonthlyLoanPayment: These columns are positively Skewed. We will dig in deeper to try different bins
CreditScoreRangelower, CreditScoreRangeUpper: They are normally distributed and look kind of similar
BorrowerRate, BoorrowerAPR: These columns have a similar distribution structure
#get Categorical Columns.
prosper_reduced_data_object = prosper_reduced_data.select_dtypes(include=['object', 'category'])
#exclude listingid
prosper_reduced_data_object = prosper_reduced_data_object.iloc[:, 1:]
prosper_reduced_data_object
| LoanStatus | ListingCategory | IncomeRange | EmploymentStatus | CreditRiskScore | |
|---|---|---|---|---|---|
| 0 | Completed | 0 | $25,000-49,999 | Self-employed | C |
| 1 | Current | 2 | $50,000-74,999 | Employed | A |
| 2 | Completed | 0 | Not displayed | Not available | HR |
| 3 | Current | 16 | $25,000-49,999 | Employed | A |
| 4 | Current | 2 | $100,000+ | Employed | D |
| ... | ... | ... | ... | ... | ... |
| 113932 | Current | 1 | $50,000-74,999 | Employed | C |
| 113933 | FinalPaymentInProgress | 7 | $75,000-99,999 | Employed | A |
| 113934 | Current | 1 | $25,000-49,999 | Employed | D |
| 113935 | Completed | 2 | $25,000-49,999 | Full-time | C |
| 113936 | Current | 1 | $50,000-74,999 | Employed | A |
#plot countplots
def countplotfunc(df, col):
plt.rcParams["figure.figsize"] = (20,7)
counts = df[col].value_counts()
status_order = counts.index
max_count = counts[0]
max_prop = max_count / df.shape[0]
tick_props = np.arange(0, max_prop, 0.1)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
title = col + ' Value Distribution'
ax = sns.countplot(data = df, x = col, order = status_order, color = 'blue').set(title = title)
plt.xticks(rotation = 45)
# values must be ordered from the largest for this to work
for i in range(counts.shape[0]):
count = counts[i]
pct_string = '{:0.1f}%'.format(100 * count / df.shape[0])
plt.text(i, count+10, pct_string, ha = 'center')
plt.tight_layout();
for col in list(prosper_reduced_data_object):
countplotfunc(prosper_reduced_data_object, col)
plt.show()
Notes:
CreditRiskScore: Most borrowers are in category C followed by B and least is NC
Employment Status: Most borrowers are employed(60.1%).
IncomeRange: Borrowers income between $25000 - $75000 (56% of customer Base)
LoanStatus: Half of borrowers are currently paying for their loans well followed by completed at 33%
ListingCategory: 50% of borrower's are for Debt Consolidation followed by Not available
# Boolean columns
prosper_boolean = prosper_reduced_data.select_dtypes(include=bool)
def piecharts(df, col):
df[col].value_counts().plot(kind = 'pie', autopct='%1.0f%%');
plt.show()
for col in list(prosper_boolean.columns):
piecharts(prosper_boolean, col)
Note:
IsBorrowerHomeowner: Very Slightly more customers have homes compared to those without homes
Incomeverfiable: 92% of borrowers are verifiable
sns.distplot(prosper_reduced_data['MonthlyLoanPayment']);
prosper_reduced_data['MonthlyLoanPayment'].describe()
count 110811.000000 mean 274.236988 std 192.719850 min 0.000000 25% 134.020000 50% 220.730000 75% 372.710000 max 2251.510000 Name: MonthlyLoanPayment, dtype: float64
This column is positively skewed with a log tail and presence of outliers in monthly loan Payment.
We will look at these two variables together to compare the gap between them.
plt.figure(figsize=(13,4))
plt.suptitle("Credit Score Range")
plt.subplot(1,2,1)
bins_lower = np.arange(0, df.CreditScoreRangeLower.max()+20, 20)
plt.hist(data = prosper_reduced_data, x = 'CreditScoreRangeLower', bins = bins_lower)
plt.xlabel('CreditScoreRangeLower')
plt.subplot(1,2,2)
bins_upper = np.arange(0, df.CreditScoreRangeUpper.max()+20, 20)
plt.hist(data = prosper_reduced_data, x = 'CreditScoreRangeUpper', bins = bins_upper)
plt.xlabel('CreditScoreRangeUpper');
prosper_reduced_data[['CreditScoreRangeLower', 'CreditScoreRangeUpper']].describe()
| CreditScoreRangeLower | CreditScoreRangeUpper | |
|---|---|---|
| count | 110809.000000 | 110809.000000 |
| mean | 686.846917 | 705.846917 |
| std | 64.549893 | 64.549893 |
| min | 0.000000 | 19.000000 |
| 25% | 660.000000 | 679.000000 |
| 50% | 680.000000 | 699.000000 |
| 75% | 720.000000 | 739.000000 |
| max | 880.000000 | 899.000000 |
The two distributions are very similar and this is because the range buckets are not large (they are of size 20) and here are all the existing buckets listed. I will keep CreditScoreRangeLower and drop the other
prosper_reduced_data.drop('CreditScoreRangeUpper', axis = 1, inplace = True)
prosper_reduced_data.rename(columns = {'CreditScoreRangeLower': 'CreditScoreRange'})
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRange | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | 640.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True | C |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | 680.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False | A |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | 480.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | 800.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True | A |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | 680.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True | D |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 2013-04-14 05:55:02.663 | NaT | Current | 36 | 10000 | 364.74 | 1 | 0.22354 | 0.1864 | 700.0 | $50,000-74,999 | True | 4333.333333 | 0.13 | Employed | 246.0 | True | C |
| 113933 | E6DB353036033497292EE43 | 2011-11-03 20:42:55.333 | NaT | FinalPaymentInProgress | 36 | 2000 | 65.57 | 7 | 0.13220 | 0.1110 | 700.0 | $75,000-99,999 | True | 8041.666667 | 0.11 | Employed | 21.0 | True | A |
| 113934 | E6E13596170052029692BB1 | 2013-12-13 05:49:12.703 | NaT | Current | 60 | 10000 | 273.35 | 1 | 0.23984 | 0.2150 | 700.0 | $25,000-49,999 | True | 2875.000000 | 0.51 | Employed | 84.0 | True | D |
| 113935 | E6EB3531504622671970D9E | 2011-11-14 13:18:26.597 | 2013-08-13 | Completed | 60 | 15000 | 449.55 | 2 | 0.28408 | 0.2605 | 680.0 | $25,000-49,999 | True | 3875.000000 | 0.48 | Full-time | 94.0 | True | C |
| 113936 | E6ED3600409833199F711B7 | 2014-01-15 09:27:37.657 | NaT | Current | 36 | 2000 | 64.90 | 1 | 0.13189 | 0.1039 | 680.0 | $50,000-74,999 | True | 4583.333333 | 0.23 | Employed | 244.0 | False | A |
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 110811 non-null object 1 ListingCreationDate 110811 non-null datetime64[ns] 2 ClosedDate 52821 non-null datetime64[ns] 3 LoanStatus 110811 non-null object 4 Term 110811 non-null int64 5 LoanOriginalAmount 110811 non-null int64 6 MonthlyLoanPayment 110811 non-null float64 7 ListingCategory 110811 non-null category 8 BorrowerAPR 110811 non-null float64 9 BorrowerRate 110811 non-null float64 10 CreditScoreRangeLower 110809 non-null float64 11 IncomeRange 110811 non-null category 12 IncomeVerifiable 110811 non-null bool 13 StatedMonthlyIncome 110811 non-null float64 14 DebtToIncomeRatio 102368 non-null float64 15 EmploymentStatus 110811 non-null object 16 EmploymentStatusDuration 105441 non-null float64 17 IsBorrowerHomeowner 110811 non-null bool 18 CreditRiskScore 110680 non-null category dtypes: bool(2), category(3), datetime64[ns](2), float64(7), int64(2), object(3) memory usage: 13.2+ MB
prosper_reduced_data
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263 | 2009-08-14 | Completed | 36 | 9425 | 330.43 | 0 | 0.16516 | 0.1580 | 640.0 | $25,000-49,999 | True | 3083.333333 | 0.17 | Self-employed | 2.0 | True | C |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900 | NaT | Current | 36 | 10000 | 318.93 | 2 | 0.12016 | 0.0920 | 680.0 | $50,000-74,999 | True | 6125.000000 | 0.18 | Employed | 44.0 | False | A |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090 | 2009-12-17 | Completed | 36 | 3001 | 123.32 | 0 | 0.28269 | 0.2750 | 480.0 | Not displayed | True | 2083.333333 | 0.06 | Not available | NaN | False | HR |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010 | NaT | Current | 36 | 10000 | 321.45 | 16 | 0.12528 | 0.0974 | 800.0 | $25,000-49,999 | True | 2875.000000 | 0.15 | Employed | 113.0 | True | A |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097 | NaT | Current | 36 | 15000 | 563.97 | 2 | 0.24614 | 0.2085 | 680.0 | $100,000+ | True | 9583.333333 | 0.26 | Employed | 44.0 | True | D |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 2013-04-14 05:55:02.663 | NaT | Current | 36 | 10000 | 364.74 | 1 | 0.22354 | 0.1864 | 700.0 | $50,000-74,999 | True | 4333.333333 | 0.13 | Employed | 246.0 | True | C |
| 113933 | E6DB353036033497292EE43 | 2011-11-03 20:42:55.333 | NaT | FinalPaymentInProgress | 36 | 2000 | 65.57 | 7 | 0.13220 | 0.1110 | 700.0 | $75,000-99,999 | True | 8041.666667 | 0.11 | Employed | 21.0 | True | A |
| 113934 | E6E13596170052029692BB1 | 2013-12-13 05:49:12.703 | NaT | Current | 60 | 10000 | 273.35 | 1 | 0.23984 | 0.2150 | 700.0 | $25,000-49,999 | True | 2875.000000 | 0.51 | Employed | 84.0 | True | D |
| 113935 | E6EB3531504622671970D9E | 2011-11-14 13:18:26.597 | 2013-08-13 | Completed | 60 | 15000 | 449.55 | 2 | 0.28408 | 0.2605 | 680.0 | $25,000-49,999 | True | 3875.000000 | 0.48 | Full-time | 94.0 | True | C |
| 113936 | E6ED3600409833199F711B7 | 2014-01-15 09:27:37.657 | NaT | Current | 36 | 2000 | 64.90 | 1 | 0.13189 | 0.1039 | 680.0 | $50,000-74,999 | True | 4583.333333 | 0.23 | Employed | 244.0 | False | A |
I will use these columns to get if a Loan is active or not
prosper_reduced_data.ListingCreationDate.describe()
count 110811 unique 110809 top 2012-10-20 12:21:46.333000 freq 2 first 2006-02-15 14:21:25.347000 last 2014-03-10 12:20:53.760000 Name: ListingCreationDate, dtype: object
prosper_reduced_data.groupby(prosper_reduced_data.ListingCreationDate.dt.year)['ListingKey'].count().plot(kind = 'bar');
plt.ylabel('Number of loans');
# 2009 was a critical year because of the financial crisis
prosper_reduced_data.ClosedDate.describe()
count 52821 unique 2682 top 2014-03-04 00:00:00 freq 105 first 2006-05-10 00:00:00 last 2014-03-10 00:00:00 Name: ClosedDate, dtype: object
#we have 57990 accounts still open
prosper_reduced_data.ClosedDate.isnull().sum()
57990
# months between loan origination and closing date
prosper_reduced_data['MonthsToClosed'] = ((prosper_reduced_data.ClosedDate - prosper_reduced_data.ListingCreationDate) / np.timedelta64(1, 'M'))
prosper_reduced_data['MonthsToClosed'].describe()
count 52821.000000 mean 19.656067 std 11.684394 min -45.421942 25% 9.645909 50% 17.455657 75% 31.265634 max 66.324018 Name: MonthsToClosed, dtype: float64
# we have accounts which have months to close as negative which is errornous
prosper_reduced_data[prosper_reduced_data['MonthsToClosed'] < 0]
| ListingKey | ListingCreationDate | ClosedDate | LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | BorrowerRate | CreditScoreRangeLower | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | MonthsToClosed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 108297 | DEAA359893047281162F432 | 2013-12-27 12:02:50.107 | 2010-03-16 | Defaulted | 36 | 2500 | 82.31 | 1 | 0.14206 | 0.1139 | 660.0 | $75,000-99,999 | False | 6250.0 | NaN | Self-employed | 7.0 | False | A | -45.421942 |
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 110811 non-null object 1 ListingCreationDate 110811 non-null datetime64[ns] 2 ClosedDate 52821 non-null datetime64[ns] 3 LoanStatus 110811 non-null object 4 Term 110811 non-null int64 5 LoanOriginalAmount 110811 non-null int64 6 MonthlyLoanPayment 110811 non-null float64 7 ListingCategory 110811 non-null category 8 BorrowerAPR 110811 non-null float64 9 BorrowerRate 110811 non-null float64 10 CreditScoreRangeLower 110809 non-null float64 11 IncomeRange 110811 non-null category 12 IncomeVerifiable 110811 non-null bool 13 StatedMonthlyIncome 110811 non-null float64 14 DebtToIncomeRatio 102368 non-null float64 15 EmploymentStatus 110811 non-null object 16 EmploymentStatusDuration 105441 non-null float64 17 IsBorrowerHomeowner 110811 non-null bool 18 CreditRiskScore 110680 non-null category 19 MonthsToClosed 52821 non-null float64 dtypes: bool(2), category(3), datetime64[ns](2), float64(8), int64(2), object(3) memory usage: 14.1+ MB
# This account creation Date is less than Closed Date. We will drop it from our analysis
prosper_reduced_data_check = prosper_reduced_data[prosper_reduced_data['MonthsToClosed'] > 0]
prosper_reduced_data['MonthsToClosed'].describe()
count 52821.000000 mean 19.656067 std 11.684394 min -45.421942 25% 9.645909 50% 17.455657 75% 31.265634 max 66.324018 Name: MonthsToClosed, dtype: float64
prosper_reduced_data.columns
Index(['ListingKey', 'ListingCreationDate', 'ClosedDate', 'LoanStatus', 'Term',
'LoanOriginalAmount', 'MonthlyLoanPayment', 'ListingCategory',
'BorrowerAPR', 'BorrowerRate', 'CreditScoreRangeLower', 'IncomeRange',
'IncomeVerifiable', 'StatedMonthlyIncome', 'DebtToIncomeRatio',
'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner',
'CreditRiskScore', 'MonthsToClosed'],
dtype='object')
#dropping date columns, interestRate and listingkey since we won't need them further
prosper_reduced_data.drop(columns = ['ListingKey','BorrowerRate', 'ListingCreationDate', 'ClosedDate'], inplace = True)
prosper_reduced_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 110811 entries, 0 to 113936 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LoanStatus 110811 non-null object 1 Term 110811 non-null int64 2 LoanOriginalAmount 110811 non-null int64 3 MonthlyLoanPayment 110811 non-null float64 4 ListingCategory 110811 non-null category 5 BorrowerAPR 110811 non-null float64 6 CreditScoreRangeLower 110809 non-null float64 7 IncomeRange 110811 non-null category 8 IncomeVerifiable 110811 non-null bool 9 StatedMonthlyIncome 110811 non-null float64 10 DebtToIncomeRatio 102368 non-null float64 11 EmploymentStatus 110811 non-null object 12 EmploymentStatusDuration 105441 non-null float64 13 IsBorrowerHomeowner 110811 non-null bool 14 CreditRiskScore 110680 non-null category 15 MonthsToClosed 52821 non-null float64 dtypes: bool(2), category(3), float64(7), int64(2), object(2) memory usage: 10.7+ MB
prosper_reduced_data_samp = prosper_reduced_data.sample(frac = 0.3, random_state = 1)
prosper_reduced_data_samp_numerical = prosper_reduced_data_samp.select_dtypes(include=numerics)
sns.pairplot(prosper_reduced_data_samp_numerical);
#Correlation matrix
sns.heatmap(prosper_reduced_data_samp_numerical.corr(), annot = True);
Based on our dependant variable, BorrowerAPR, we can see significant correlations with the following variables:
Negative Correlation with LoanOriginalAmount, MonthlyLoanPayment,CreditScoreRangeLower
Additionally, we can see positive correlations in LoanOriginalAmount with MonthlyLoanPayment, CreditScoreRangeLower, StatedMonthlyIncome
def ScatterPlots(df, col):
sns.regplot(data = df, x = df[col], y = df.BorrowerAPR, scatter_kws={'alpha':0.3}, y_jitter = 0.2)
plt.show()
# With Outliers
correlated_cols = ['LoanOriginalAmount', 'MonthlyLoanPayment','CreditScoreRangeLower']
for col in correlated_cols:
ScatterPlots(prosper_reduced_data_samp_numerical, col)
#Capping outliers so as to get a clear view of APR vs Correlated variables
def cap_data(df):
for col in df.columns:
print("capping the ",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(prosper_reduced_data_samp_numerical)
capping the Term capping the LoanOriginalAmount capping the MonthlyLoanPayment capping the BorrowerAPR capping the CreditScoreRangeLower capping the StatedMonthlyIncome capping the DebtToIncomeRatio capping the EmploymentStatusDuration capping the MonthsToClosed
for col in correlated_cols:
ScatterPlots(final_df, col)
The negative correlations of BorrowerAPR and loan original amount, monthly loan payment and credit score range lower are somewhat visible from the scatter plots and they all make logical sense.
Let's check the same with relationship between Correlated columns with Loan Original Amount
def ScatterPlots(df, col, y_axis):
sns.regplot(data = df, x = df[col], y = df[y_axis], scatter_kws={'alpha':0.3}, y_jitter = 0.2)
plt.show()
correlated_cols = ['MonthlyLoanPayment', 'CreditScoreRangeLower', 'StatedMonthlyIncome']
for col in correlated_cols:
ScatterPlots(final_df, col, 'LoanOriginalAmount')
With the above selected variable, there is positive correlation with Loan original Amount
prosper_reduced_data_samp
| LoanStatus | Term | LoanOriginalAmount | MonthlyLoanPayment | ListingCategory | BorrowerAPR | CreditScoreRangeLower | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | DebtToIncomeRatio | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CreditRiskScore | MonthsToClosed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84581 | Current | 36 | 20000 | 703.04 | 1 | 0.19645 | 740.0 | $25,000-49,999 | True | 2916.666667 | 0.78 | Employed | 414.0 | False | B | NaN |
| 75489 | Current | 60 | 10000 | 253.66 | 1 | 0.20347 | 700.0 | $50,000-74,999 | True | 5884.083333 | 0.36 | Employed | 388.0 | True | C | NaN |
| 112937 | Completed | 36 | 1200 | 45.44 | 1 | 0.23635 | 580.0 | $25,000-49,999 | True | 4041.666667 | 0.14 | Full-time | 2.0 | True | E | 10.817918 |
| 97928 | Current | 36 | 7550 | 249.08 | 1 | 0.14348 | 760.0 | $50,000-74,999 | True | 5666.666667 | 0.10 | Employed | 2.0 | True | A | NaN |
| 15126 | Defaulted | 36 | 5000 | 173.33 | 0 | 0.15713 | 640.0 | $25,000-49,999 | True | 3880.000000 | 0.07 | Full-time | 69.0 | False | C | 28.391151 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36372 | Current | 60 | 9000 | 232.23 | 1 | 0.21192 | 660.0 | $50,000-74,999 | True | 4583.333333 | 0.08 | Employed | 1.0 | False | C | NaN |
| 22562 | Completed | 36 | 8500 | 275.63 | 8 | 0.13138 | 760.0 | $75,000-99,999 | True | 6250.000000 | 0.24 | Employed | 147.0 | True | A | 8.628991 |
| 108827 | Current | 36 | 4000 | 135.84 | 1 | 0.17151 | 680.0 | $25,000-49,999 | True | 2500.000000 | 0.22 | Employed | 39.0 | False | B | NaN |
| 102049 | Current | 36 | 10000 | 400.35 | 1 | 0.29394 | 680.0 | $50,000-74,999 | False | 5000.000000 | NaN | Employed | 121.0 | False | D | NaN |
| 56149 | FinalPaymentInProgress | 36 | 3000 | 103.54 | 1 | 0.18316 | 720.0 | $50,000-74,999 | True | 4166.666667 | 0.20 | Full-time | 94.0 | True | B | NaN |
# plot matrix of numeric features against categorical features.
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sns.color_palette()[0]
sns.boxplot(x, y, color = default_color)
plt.figure(figsize = [12, 20])
g = sns.PairGrid(data = prosper_reduced_data_samp, x_vars = ['BorrowerAPR', 'StatedMonthlyIncome', 'LoanOriginalAmount'],
y_vars = prosper_reduced_data_object.columns, size = 3, aspect = 1.5)
g.map(boxgrid);
plt.xticks(rotation=45);
<Figure size 864x1440 with 0 Axes>
For the above boxplot we can see that;
g = sns.FacetGrid(data = prosper_reduced_data_samp, col = 'IncomeVerifiable')
g.map(sns.distplot, 'BorrowerAPR');
g = sns.FacetGrid(data = prosper_reduced_data_samp, col = 'IsBorrowerHomeowner')
g.map(sns.distplot, 'BorrowerAPR');
Note;
The borrower APR is negatively associated with the loan original amount, which mean the more the loan amount, the lower the APR. It also shows that at different size of the loan amount, the APR has a large range, but the range of APR decrease with the increase of loan amount. The Prosper rating also has a strong effect on the borrower APR, which decreases with the better rating.
The loan original amount is positively correlated with the stated monthly income, it makes sense since borrowers with more monthly income could loan more money. It also shows that borrowers with better rating also have larger monthly income and loan amount. retired borowers tend to have a low APR and get Low Original Loan amount which is against the norm of High APR Score = Low LoanOriginalAmount. Also, There is much difference between homeowner's APR with non-homeowners
This section will build on previous findings and combine multiple variables to see how they interact. A few sections will follow, each focusing on one variable and its relationship with others.
prosper_reduced_data_samp_numerical.columns
Index(['Term', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'BorrowerAPR',
'CreditScoreRangeLower', 'StatedMonthlyIncome', 'DebtToIncomeRatio',
'EmploymentStatusDuration', 'MonthsToClosed'],
dtype='object')
plt.figure(figsize=(13,4))
<Figure size 936x288 with 0 Axes>
<Figure size 936x288 with 0 Axes>
# Pair plot colored by credit risk
sns.pairplot(prosper_reduced_data_samp,
vars = prosper_reduced_data_samp_numerical.columns,
hue = 'CreditRiskScore', diag_kind = 'kde', palette = 'RdYlBu_r',
plot_kws = {'alpha': 0.6, 's': 90},
size = 4);
We can see separation between the different creditScoreGroups
plt.figure(figsize=(16,15))
g = sns.FacetGrid(data = prosper_reduced_data_samp, col = 'CreditRiskScore', hue = 'Term');
g.map(sns.scatterplot, 'LoanOriginalAmount', 'BorrowerAPR');
g.add_legend()
plt.show();
<Figure size 1152x1080 with 0 Axes>
In all Most CreditRiskGroups, the APR tends to be higher in 36 months compared to 60 Months
plt.figure(figsize=(16,15))
g = sns.FacetGrid(data = prosper_reduced_data_samp, col = 'CreditRiskScore', hue = 'Term');
g.map(sns.scatterplot, 'DebtToIncomeRatio' , 'BorrowerAPR');
g.add_legend()
plt.show();
<Figure size 1152x1080 with 0 Axes>
In most CreditRiskGroups, the lower your debttoincomeRatio the lower the borrower's APR.
sns.catplot(x = 'EmploymentStatus', y = 'BorrowerAPR', hue = 'CreditRiskScore', order = \
['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
kind = 'box', data = prosper_reduced_data_samp, size = 15, palette = 'RdYlBu_r');
sns.catplot(x = 'EmploymentStatus', y = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', order = \
['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
kind = 'box', data = prosper_reduced_data_samp, size = 15, palette = 'colorblind');
Differences among employment statuses are not very significant or consistent. It makes it difficult to make any conclusion
fig = plt.figure(figsize = [8,6])
ax = sns.pointplot(data = prosper_reduced_data_samp, x = 'c', y = 'BorrowerAPR', hue = 'Term',
palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Borrower APR across rating and term')
plt.ylabel('Mean Borrower APR')
ax.set_yticklabels([],minor = True);
It is interesting that for borrowers with creditScore category AA and A, there APR tend to increase with increased term length compared to higher risk borrowers D and E where 12 months term tend to have highest APR and it reduces for the other groups
sns.pointplot(data = prosper_reduced_data_samp, x = 'CreditRiskScore', y = 'LoanOriginalAmount', hue = 'Term',
palette = 'Greens', linestyles = '', dodge = 0.4);
there is a interaction between term and rating. We can see that with better Credit Score rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.
Credit risk category, as expected and hoped, is a major factor which segments the borrowers into clearly distinguishable groups. Being assigned into one of the risk categories strongly determines the interest rate and the borrowed amount.
There wasn't much between homeowners and non-homeowners between any employment groups